Case Study 3: Fairness Testing for Algorithmic Pricing

A practical framework for fair algorithmic pricing

Author

Fei Huang

Published

June 11, 2026

Before you start

Read Step 4: Audit the system first for the qualitative audit protocol. This case study implements CDP and proxy-discrimination testing on Illinois auto insurance quotes.

Background

Regulatory agencies have increasingly required firms to assess whether algorithmic pricing systems produce unfair outcomes for protected groups. For example, recent regulatory proposals in Colorado and New York (Colorado Division of Insurance 2023; New York State Department of Financial Services 2024) require insurers to evaluate whether pricing algorithms and predictive models produce discriminatory pricing outcomes.

Many fairness audits rely on regression-based methods. A common approach is to compare pricing outcomes across protected groups after controlling for legitimate risk factors and then evaluate whether any remaining disparity falls within an acceptable tolerance.

This case study illustrates how regression-based fairness audits can be conducted for deterministic pricing outputs. It is based on Huang and Hooker (2026). In particular, we focus on two fairness criteria that are closely related to current regulatory proposals:

  • Conditional demographic parity (CDP). after controlling for legitimate risk factors, do pricing outcomes differ systematically across protected groups?
  • Proxy discrimination (PD). does a rating variable act as a statistical proxy for a protected attribute?

A distinctive feature of algorithmic pricing systems is that they are often deterministic. The same profile submitted to the same algorithm produces the same price. As a result, residuals from an audit regression represent approximation error rather than sampling variability. This distinction has important implications for statistical inference and forms the central motivation for the empirical analysis in this case study.

To illustrate the audit framework, we use the ProPublica Illinois auto insurance dataset (Larson et al. 2017), which contains quoted premiums from multiple insurers across Illinois zip codes. Download the public data release, extract car-insurance-public/data/il-per-zip.csv, and place it in this case study folder before running the code. See Data for details.

The analysis is designed to show how to run a regression-based fairness audit, not to deliver filing-ready compliance conclusions for any named insurer.

How to Read This Case Study

Not every section is needed for every reader. The table below suggests a practical reading order.

If your focus is… Start here Then read
Audit workflow and regulatory logic Audit protocol CDP auditPD audit
What the Illinois results show Data preparation CDP audit and PD audit summary tables and figures
Implementation detail NotationCase Study Setup Supporting Materials and code chunks (folded by default)

Audit protocol

This case study applies the Plan → Audit → Decide → Improve workflow from Step 4: Audit the system. Design choices are fixed before examining outcomes. The sections below show how that protocol is implemented on Illinois auto quotes.

Audit flow: Plan, Audit, Decide, Improve with Pass, Insufficient Information, or Fail

For this illustration we use CDP for system-level price gaps (with TOST) and PD for the suspected proxy variable log_risk. Protected attribute: majority-minority zip-code indicator. Legitimate factors: log state risk and a Chicago indicator. Tolerances and decision rules are in Audit parameters. Pass, fail, and insufficient-information outcomes follow the three-outcome framework in Step 4.

For practitioners

This case study applies one illustrative audit design from Huang and Hooker (2026) to Illinois auto quotes. In your own market you must choose:

  • Fairness criterion (CDP, PD, or another rule required by law)
  • Protected attribute A (directly observed or responsibly proxied)
  • Legitimate rating factors X_\ell (as recognised by your regulator)
  • Tolerance margins (for example, Colorado’s 5% price-gap discussion and standard 0.80 adverse-impact ratio)
  • Suspected proxy variables for PD screens

Quoted premiums here are deterministic algorithm outputs, so classical OLS standard errors are generally not valid. Use the corrected estimators shown below. Results are a teaching illustration using zip-level data and a proxied protected attribute, not a definitive regulatory finding about any company.

Scope of this case study
  • ProPublica Illinois auto quotes, company-level audits
  • Protected attribute: majority-minority zip-code indicator (a geographic proxy, not observed race)
  • Legitimate factors: state_risk, chicago
  • CDP via log-premium regression + TOST. PD illustration for log_risk
  • Corrected inference (HC3. PD cross-covariance)

The majority-minority zip indicator is a geographic proxy. For BISG or BIFSG-based regulatory designs, see Step 4: Proxied protected attributes and Xin, Hooker, and Huang (2026) on how proxy measurement distorts regression-based disparity estimates.

Not covered here. GLM audit specifications, formal sample-size planning, multi-proxy screens, BISG/BIFSG inference, or insurer-specific filing conclusions. See Huang and Hooker (2026) for the audit protocol and Xin, Hooker, and Huang (2026) for proxy-race measurement limits.

Case Study Setup

Data and audit setting

The empirical analysis is conducted at the company level. For each insurer, we observe quoted premiums across Illinois zip codes. The goal is to illustrate how regression-based fairness audits can be applied to deterministic pricing outputs.

To reproduce the results locally, place il-per-zip.csv in this folder (see Data). The code below expects DATA_PATH = "il-per-zip.csv".

The key variables are:

  • combined_premium, annual combined liability premium
  • minority, indicator equal to 1 if the zip code is majority-minority
  • state_risk, aggregate loss cost per insured vehicle
  • chicago, indicator for Chicago zip codes
  • companies_name, insurer name

In the analysis below, each insurer is audited separately. The protected attribute is represented by the majority-minority zip-code indicator (A_z = 1), while state_risk and chicago are used as legitimate rating factors (X_\ell). Symbols are defined in Notation.

Audit parameters

The table below maps pre-specified audit parameters to their regulatory interpretation. Values are fixed in code before running company-level tests.

Parameter Code value Role
Significance level ALPHA = 0.05 90% confidence intervals for TOST (1 - 2\alpha)
Minimum zip codes per company MIN_OBS = 50 Exclude thin company samples from company-level tests
CDP dollar tolerance DELTA_PCT = 0.05 5% of mean premium (Colorado draft guidance)
CDP ratio tolerance TAU = 0.80 Standard 0.80 adverse-impact ratio band
PD substantive threshold RHO_MIN = 0.10 Flag only if relative coefficient shift \geq 10\%

Import packages and define audit parameters

The following code imports the required Python packages and sets the parameters above.

Show setup code
import sys
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from IPython.display import display

DATA_PATH   = "il-per-zip.csv"
MIN_OBS     = 50
ALPHA       = 0.05
Z_CRIT      = stats.norm.ppf(1.0 - ALPHA)

# CDP TOST tolerance margins
DELTA_PCT   = 0.05
TAU         = 0.80
LOG_TAU     = np.log(1.0 / TAU)

# PD substantive shift threshold
RHO_MIN     = 0.10

Supporting functions

The following functions implement the calculations used throughout the CDP and PD audits. The code is folded for readability but can be expanded if desired.

Show supporting functions code
def ols_fit(X, y):
    """
    OLS estimator: beta_hat = (X'X)^(-1) X'y.
    """
    XtX_inv = np.linalg.inv(X.T @ X)
    beta    = XtX_inv @ (X.T @ y)
    resid   = y - X @ beta
    return beta, resid, XtX_inv


def cov_classical(resid, XtX_inv, n, p):
    """
    Classical OLS covariance: sigma^2 * (X'X)^(-1).
    """
    sigma2 = float(resid @ resid) / (n - p)
    return sigma2 * XtX_inv, sigma2


def cov_hc3(X, resid, XtX_inv):
    """
    HC3 sandwich covariance estimator.
    """
    h    = np.einsum('ij,jk,ik->i', X, XtX_inv, X)
    w    = resid / (1.0 - h)
    meat = (X * w[:, np.newaxis]).T @ (X * w[:, np.newaxis])
    return XtX_inv @ meat @ XtX_inv


def cov_shift_full(X_res, X_ext, fz, j=1, k=1):
    """
    Correct variance of the coefficient shift for proxy discrimination.

    X_res : restricted model matrix, excluding the protected attribute.
    X_ext : extended model matrix, including the protected attribute.
    fz    : deterministic response vector.
    j, k  : column indices for the suspected proxy variable in the two models.
    """
    n = len(fz)

    XtX_inv_res = np.linalg.inv(X_res.T @ X_res)
    XtX_inv_ext = np.linalg.inv(X_ext.T @ X_ext)

    sc_res = X_res * fz[:, np.newaxis]
    sc_ext = X_ext * fz[:, np.newaxis]

    meat_res  = (sc_res.T @ sc_res) / n
    meat_res -= np.outer(sc_res.mean(axis=0), sc_res.mean(axis=0))
    sw_res    = XtX_inv_res @ meat_res @ XtX_inv_res
    var_res   = sw_res[j, j]

    meat_ext  = (sc_ext.T @ sc_ext) / n
    meat_ext -= np.outer(sc_ext.mean(axis=0), sc_ext.mean(axis=0))
    sw_ext    = XtX_inv_ext @ meat_ext @ XtX_inv_ext
    var_ext   = sw_ext[k, k]

    var_ind = var_res + var_ext

    cross_meat  = (sc_res.T @ sc_ext) / n
    cross_meat -= np.outer(sc_res.mean(axis=0), sc_ext.mean(axis=0))
    cross       = float(XtX_inv_res[j, :] @ cross_meat @ XtX_inv_ext[:, k])

    var_full = max(var_res + var_ext - 2.0 * cross, 0.0)

    return var_full, var_ind


def tost_verdict(beta_A, se_hc3, mean_premium, alpha=ALPHA,
                 delta_pct=DELTA_PCT, tau=TAU):
    """
    Three-outcome TOST decision for CDP: Pass, Fail, or
    Insufficient Information.
    """
    z = stats.norm.ppf(1.0 - alpha)
    ci_lo = beta_A - z * se_hc3
    ci_hi = beta_A + z * se_hc3

    log_tau = np.log(1.0 / tau)
    delta = delta_pct * mean_premium

    ratio_inside = (ci_lo > -log_tau) and (ci_hi < log_tau)
    ratio_outside = (ci_lo >= log_tau) or (ci_hi <= -log_tau)

    gap_lo = mean_premium * (np.exp(ci_lo) - 1.0)
    gap_hi = mean_premium * (np.exp(ci_hi) - 1.0)

    level_inside = (gap_lo > -delta) and (gap_hi < delta)
    level_outside = (gap_lo >= delta) or (gap_hi <= -delta)

    if ratio_inside and level_inside:
        verdict = "PASS"
    elif ratio_outside or level_outside:
        verdict = "FAIL"
    else:
        verdict = "INSUFFICIENT INFORMATION"

    return verdict, ci_lo, ci_hi, gap_lo, gap_hi


def pd_verdict(shift, phi_res, z_full, z_crit=Z_CRIT, rho_min=RHO_MIN):
    """
    Two-part proxy-discrimination decision rule.
    """
    rel_pct = abs(shift / phi_res) * 100.0 if phi_res != 0 else np.nan
    flag    = (abs(z_full) > z_crit) and (rel_pct >= rho_min * 100.0)
    return flag, rel_pct

Data preparation

Show data loading code
def prepare_data(path):
    df = pd.read_csv(path)

    required = {
        "companies_name", "minority", "state_risk",
        "combined_premium", "chicago"
    }
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns in CSV: {missing}")

    df = df.copy()
    df["minority_flag"] = df["minority"].astype(int)
    df["log_risk"] = np.log(df["state_risk"])
    df["log_premium"] = np.log(df["combined_premium"])
    df["excess_premium"] = df["combined_premium"] / df["state_risk"]
    df["chicago"] = df["chicago"].astype(float)

    return df


df = prepare_data(DATA_PATH)

print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")

df.head()
Number of observations: 31382
Number of variables: 15
zipcode chicago minority companies_name name bi_policy_premium pd_policy_premium state_risk white_non_hisp_pct risk_difference combined_premium minority_flag log_risk log_premium excess_premium
0 60002 0.0 False State Farm Fire & Cas Co STATE FARM GRP 414 0.0 200.692503 88.6 213.307497 414.0 0 5.301774 6.025866 2.062857
1 60002 0.0 False Metropolitan Grp Prop & Cas Ins Co METROPOLITAN GRP 321 244.0 200.692503 88.6 364.307497 565.0 0 5.301774 6.336826 2.815252
2 60002 0.0 False Progressive Direct Ins Co PROGRESSIVE GRP 360 176.0 200.692503 88.6 335.307497 536.0 0 5.301774 6.284134 2.670752
3 60002 0.0 False American Family Mut Ins Co AMERICAN FAMILY INS GRP 458 0.0 200.692503 88.6 257.307497 458.0 0 5.301774 6.126869 2.282098
4 60002 0.0 False Garrison Prop & Cas Ins Co UNITED SERV AUTOMOBILE ASSN GRP 171 171.0 200.692503 88.6 141.307497 342.0 0 5.301774 5.834811 1.704100

The dataset contains zip-code-company observations. Before running company-level audits, we first summarise the unconditional differences between majority-minority and non-majority-minority zip codes.

Show summary statistics code
def table_summary_stats(df):
    non_mm = df[df["minority_flag"] == 0]
    mm = df[df["minority_flag"] == 1]

    rows = []
    for var, label, scale in [
        ("combined_premium", "Combined premium ($)", 1),
        ("state_risk", "State risk ($)", 1),
        ("excess_premium", "Excess premium ($)", 1),
        ("minority", "Pct minority (%)", 100),
    ]:
        full = df[var] * scale
        non_mm_var = non_mm[var] * scale
        mm_var = mm[var] * scale

        rows.append({
            "Variable": label,
            "Mean": round(full.mean(), 1),
            "SD": round(full.std(), 1),
            "Max": round(full.max(), 1),
            "Non-majority-minority zips": round(non_mm_var.mean(), 1),
            "Majority-minority zips": round(mm_var.mean(), 1),
            "Ratio": round(mm_var.mean() / non_mm_var.mean(), 3)
        })

    return pd.DataFrame(rows)

summary_stats = table_summary_stats(df)
display(summary_stats)
Variable Mean SD Max Non-majority-minority zips Majority-minority zips Ratio
0 Combined premium ($) 370.2 147.7 1345.0 355.9 482.2 1.355
1 State risk ($) 162.7 50.9 297.5 160.6 179.2 1.116
2 Excess premium ($) 2.5 1.3 15.0 2.5 2.8 1.121
3 Pct minority (%) 11.3 31.6 100.0 0.0 100.0 inf
Note

The unconditional premium difference is not itself a fairness-audit conclusion, because majority-minority and non-majority-minority zip codes may differ in risk and geography. The regression audit below controls for state risk and a Chicago indicator.

Notation

Symbol Meaning Variable in this case study
P_{kz} Quoted premium for company k in zip z combined_premium
A_z Protected attribute minority_flag (1 = majority-minority zip)
X_{\ell,z} Legitimate rating factors log_risk, chicago
\beta_{A,k} Conditional log-premium gap on A for company k CDP regression coefficient
\Delta_\mu Conditional mean premium gap Implied from \beta_{A,k} at mean premium
R_\mu Conditional premium ratio across groups \exp(\beta_{A,k})
\phi, \phi' Coefficient on suspected proxy in restricted / extended PD models Coefficient on log_risk
\Delta_{PD} Proxy coefficient shift \phi - \phi' PD test statistic
HC3 SE Corrected standard error for deterministic outputs se_hc3
\delta Dollar tolerance for CDP DELTA_PCT \times mean premium
\tau Adverse-impact ratio tolerance TAU = 0.80
\rho_{\min} Minimum relative PD shift to flag RHO_MIN = 0.10

Rule abbreviations. CDP = conditional demographic parity. PD = proxy discrimination. TOST = two one-sided tests (equivalence testing).

Conditional Demographic Parity Audit

Audit model

The CDP audit evaluates whether pricing outcomes differ across protected groups after controlling for legitimate rating factors. For each company k, we fit the audit regression (see Notation)

\log(P_{kz}) = \mu_{0k} + \beta_{A,k} A_z + \gamma_{1k} \log(\text{state risk}_z) + \gamma_{2k} \text{Chicago}_z + r_{kz},

where P_{kz} is the quoted premium for company k in zip code z, and A_z is the majority-minority zip-code indicator. The coefficient \beta_{A,k} measures the conditional log-premium gap for company k.

Show design matrix code
def build_design_matrices(grp):
    n = len(grp)
    ones = np.ones(n)
    A = grp["minority_flag"].values.astype(float)
    lr = grp["log_risk"].values
    C = grp["chicago"].values

    X_cdp = np.column_stack([ones, A, lr, C])
    X_res = np.column_stack([ones, lr, C])
    X_ext = np.column_stack([ones, lr, C, A])
    fz = grp["log_premium"].values

    return X_cdp, X_res, X_ext, fz, n

Classical versus HC3 standard errors

The first issue is whether the usual classical OLS standard error is appropriate. Since quoted premiums are deterministic outputs, the residuals are approximation errors rather than independent stochastic noise. We therefore compare the usual classical standard error with the HC3 sandwich standard error. HC3 is a finite-sample heteroskedasticity-consistent sandwich estimator proposed by MacKinnon and White (1985) and used here as the default corrected standard error.

Show CDP analysis code
def analyse_cdp_company(company, grp):
    X_cdp, X_res, X_ext, fz, n = build_design_matrices(grp)
    p_cdp = X_cdp.shape[1]

    # CDP regression
    beta_cdp, resid_cdp, XtX_inv_cdp = ols_fit(X_cdp, fz)
    cov_cl, sigma2 = cov_classical(resid_cdp, XtX_inv_cdp, n, p_cdp)
    cov_hc = cov_hc3(X_cdp, resid_cdp, XtX_inv_cdp)

    beta_A = float(beta_cdp[1])
    se_cl = float(np.sqrt(cov_cl[1, 1]))
    se_hc3 = float(np.sqrt(cov_hc[1, 1]))
    rho = se_hc3 / se_cl

    ss_tot = float(np.sum((fz - fz.mean()) ** 2))
    r2 = 1.0 - float(resid_cdp @ resid_cdp) / ss_tot

    mean_prem_log = fz.mean()
    mean_prem_usd = float(np.exp(mean_prem_log))

    verdict, ci_lo, ci_hi, gap_lo, gap_hi = tost_verdict(
        beta_A, se_hc3, mean_prem_usd
    )

    ratio = float(np.exp(beta_A))
    gap_dollars = float(mean_prem_usd * (np.exp(beta_A) - 1.0))

    cdp_rec = {
        "company": company,
        "n": n,
        "beta_A": beta_A,
        "se_classical": se_cl,
        "se_hc3": se_hc3,
        "rho": rho,
        "r2": r2,
    }

    audit_rec = {
        "company": company,
        "gap_usd": gap_dollars,
        "ratio": ratio,
        "ci_lo": ci_lo,
        "ci_hi": ci_hi,
        "gap_lo": gap_lo,
        "gap_hi": gap_hi,
        "verdict": verdict,
    }

    return cdp_rec, audit_rec


cdp_records = []
audit_records = []

for company, grp in df.groupby("companies_name"):
    if len(grp) >= MIN_OBS:
        cdp_rec, audit_rec = analyse_cdp_company(company, grp)
        cdp_records.append(cdp_rec)
        audit_records.append(audit_rec)

cdp_results = pd.DataFrame(cdp_records).sort_values("company").reset_index(drop=True)
audit_results = pd.DataFrame(audit_records).sort_values("company").reset_index(drop=True)

print(f"Number of companies audited: {len(cdp_results)}")
cdp_results.head()
Number of companies audited: 34
company n beta_A se_classical se_hc3 rho r2
0 Allstate Fire & Cas Ins Co 923 0.215017 0.016442 0.018462 1.122903 0.437081
1 Allstate Ind Co 923 0.317649 0.022141 0.024226 1.094157 0.507276
2 American Family Mut Ins Co 923 0.150669 0.012237 0.010640 0.869509 0.502472
3 American Standard Ins Co of WI 923 0.136981 0.013006 0.010412 0.800583 0.430596
4 Country Mut Ins Co 923 0.260289 0.017204 0.020499 1.191543 0.349394

The next table summarises the company-level comparison between classical and HC3 standard errors.

Show CDP standard error summary code
se_summary = pd.DataFrame({
    "Number of companies": [len(cdp_results)],
    "Min HC3/classical SE ratio": [cdp_results["rho"].min()],
    "Mean HC3/classical SE ratio": [cdp_results["rho"].mean()],
    "Max HC3/classical SE ratio": [cdp_results["rho"].max()],
    "Companies with |ratio - 1| > 0.15": [
        (abs(cdp_results["rho"] - 1) > 0.15).sum()
    ]
})

display(se_summary.round(3))
Number of companies Min HC3/classical SE ratio Mean HC3/classical SE ratio Max HC3/classical SE ratio Companies with |ratio - 1| > 0.15
0 34 0.685 1.065 1.775 14

The table below reports selected companies where the HC3 correction has the largest and smallest effect on the standard error for the protected-attribute coefficient.

Show selected CDP standard error results code
selected_se = pd.concat([
    cdp_results.nsmallest(3, "rho"),
    cdp_results.nlargest(3, "rho")
]).drop_duplicates(subset="company")

selected_se = selected_se[[
    "company", "n", "beta_A", "se_classical", "se_hc3", "rho", "r2"
]].copy()

selected_se = selected_se.round({
    "beta_A": 3,
    "se_classical": 3,
    "se_hc3": 3,
    "rho": 3,
    "r2": 3
})

display(selected_se)
company n beta_A se_classical se_hc3 rho r2
30 Trumbull Ins Co 923 0.130 0.012 0.009 0.685 0.405
26 State Farm Fire & Cas Co 923 0.222 0.017 0.013 0.753 0.450
27 State Farm Mut Auto Ins Co 923 0.222 0.017 0.013 0.753 0.450
6 Economy Preferred Ins Co 923 0.308 0.017 0.030 1.775 0.476
18 Metropolitan Cas Ins Co 923 0.308 0.017 0.029 1.679 0.482
9 Farmers Automobile Ins Assoc 923 0.294 0.024 0.040 1.660 0.675
Show CDP standard error ratio plot code
plot_df = cdp_results.sort_values("rho").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["rho"], marker="o", linestyle="")
plt.axhline(1.0, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by SE ratio")
plt.ylabel("HC3 SE / classical SE")
plt.title("Classical standard errors can under- or over-estimate uncertainty")
plt.tight_layout()
plt.show()
Figure 1: Ratio of HC3 to classical standard errors for the protected-attribute coefficient.
Note

A ratio above 1 means the classical standard error is smaller than the HC3 standard error. A ratio below 1 means the classical standard error is larger. The direction is not known in advance. This is why the corrected standard error should be computed rather than assumed.

CDP decision rule using TOST

A conventional significance test asks whether the conditional disparity is exactly zero. However, regulatory fairness audits are typically concerned with a different question: whether any disparity is small enough to fall within a pre-specified tolerance.

To address this, we use an equivalence-testing framework known as Two One-Sided Tests (TOST) (Schuirmann 1987). Under TOST, the burden is not to show that the disparity is exactly zero, but to demonstrate that it is sufficiently small to fall within an acceptable range.

Here we use two tolerance conditions:

  1. the confidence interval for the log-premium gap must lie within the log-ratio band corresponding to an adverse-impact ratio of 0.80. And
  2. the implied dollar gap must lie within 5% of the mean premium.

A company passes only if both confidence intervals lie entirely within the tolerance bands. It fails if either interval lies entirely outside. Otherwise the verdict is insufficient information the sample is too imprecise to confirm compliance or a material disparity (Huang and Hooker (2026)).

Show CDP verdict summary code
cdp_summary = (
    audit_results["verdict"]
    .value_counts()
    .rename_axis("Verdict")
    .reset_index(name="Number of companies")
)

display(cdp_summary)
Verdict Number of companies
0 FAIL 34
Show selected CDP verdict results code
selected_cdp = pd.concat([
    audit_results.nsmallest(3, "gap_usd"),
    audit_results.nlargest(3, "gap_usd")
]).drop_duplicates(subset="company")

selected_cdp = selected_cdp[[
    "company", "gap_usd", "ratio", "ci_lo", "ci_hi", "verdict"
]].copy()

selected_cdp = selected_cdp.round({
    "gap_usd": 0,
    "ratio": 3,
    "ci_lo": 3,
    "ci_hi": 3
})

display(selected_cdp)
company gap_usd ratio ci_lo ci_hi verdict
31 USAA Cas Ins Co 18.0 1.095 0.075 0.105 FAIL
33 United Serv Automobile Assn 19.0 1.099 0.078 0.111 FAIL
30 Trumbull Ins Co 28.0 1.138 0.116 0.144 FAIL
1 Allstate Ind Co 231.0 1.374 0.278 0.357 FAIL
19 Metropolitan Grp Prop & Cas Ins Co 162.0 1.323 0.240 0.320 FAIL
18 Metropolitan Cas Ins Co 140.0 1.361 0.260 0.355 FAIL
Show CDP premium gap plot code
plot_df = audit_results.sort_values("gap_usd").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["gap_usd"], marker="o", linestyle="")
plt.axhline(0.0, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by estimated gap")
plt.ylabel("Estimated dollar gap")
plt.title("Conditional premium gap for majority-minority zip codes")
plt.tight_layout()
plt.show()
Figure 2: Estimated conditional dollar premium gap across companies.
Note

Unlike conventional significance testing, TOST requires positive evidence that the disparity is sufficiently small. Insufficient information is not a pass. It means the audit cannot yet close and more data or a pre-specified escalation step may be needed.

Proxy Discrimination Audit

Coefficient-shift idea

Proxy discrimination asks whether a rating variable acts as a statistical substitute for the protected attribute. In this illustration, the suspected proxy variable is log_risk.

We compare two regressions:

\text{Restricted model: } \log(P_i) = \mu + \phi \log(\text{risk}_i) + \gamma \text{Chicago}_i + r_i,

\text{Extended model: } \log(P_i) = \mu' + \phi' \log(\text{risk}_i) + \gamma' \text{Chicago}_i + \kappa A_i + r'_i.

The coefficient shift is

\Delta_{PD} = \phi - \phi'.

If the coefficient on log_risk changes meaningfully after adding the protected attribute, this suggests that log_risk was absorbing some protected-group variation in the restricted model.

Standard versus corrected standard errors

The usual independent-samples formula treats the two coefficient estimates as independent. That is not appropriate here because both regressions are fitted to the same deterministic premium vector. The corrected standard error includes the cross-covariance between the two estimates.

Show proxy discrimination analysis code
def analyse_pd_company(company, grp):
    X_cdp, X_res, X_ext, fz, n = build_design_matrices(grp)

    # PD coefficient-shift test for log_risk
    beta_res, _, _ = ols_fit(X_res, fz)
    beta_ext, _, _ = ols_fit(X_ext, fz)

    phi_res = float(beta_res[1])
    phi_ext = float(beta_ext[1])
    shift = phi_res - phi_ext

    var_full, var_ind = cov_shift_full(X_res, X_ext, fz, j=1, k=1)
    se_full = float(np.sqrt(var_full))
    se_ind = float(np.sqrt(var_ind))

    z_ind = shift / se_ind if se_ind > 0 else np.nan
    z_full = shift / se_full if se_full > 0 else np.nan

    flag, rel_pct = pd_verdict(shift, phi_res, z_full)

    pd_rec = {
        "company": company,
        "shift": rel_pct,
        "se_ind": se_ind,
        "se_full": se_full,
        "ratio_se": se_full / se_ind if se_ind > 0 else np.nan,
        "z_ind": z_ind,
        "z_full": z_full,
        "flag": "FLAG" if flag else "",
    }

    return pd_rec


pd_records = []

for company, grp in df.groupby("companies_name"):
    if len(grp) >= MIN_OBS:
        pd_rec = analyse_pd_company(company, grp)
        pd_records.append(pd_rec)

pd_results = pd.DataFrame(pd_records).sort_values("company").reset_index(drop=True)

print(f"Number of companies audited: {len(pd_results)}")
pd_results.head()
Number of companies audited: 34
company shift se_ind se_full ratio_se z_ind z_full flag
0 Allstate Fire & Cas Ins Co 9.681743 0.026227 0.002163 0.082484 0.756896 9.176298
1 Allstate Ind Co 10.485194 0.027659 0.002327 0.084133 1.060300 12.602721 FLAG
2 American Family Mut Ins Co 7.735429 0.026078 0.002129 0.081637 0.533416 6.534027
3 American Standard Ins Co of WI 8.000026 0.028534 0.002315 0.081120 0.443219 5.463756
4 Country Mut Ins Co 10.575810 0.024418 0.002025 0.082924 0.984158 11.868223 FLAG
Show proxy discrimination summary code
pd_summary = pd.DataFrame({
    "Number of companies": [len(pd_results)],
    "Significant using independent SE": [(abs(pd_results["z_ind"]) > Z_CRIT).sum()],
    "Significant using corrected SE": [(abs(pd_results["z_full"]) > Z_CRIT).sum()],
    "Flagged using corrected two-part rule": [(pd_results["flag"] == "FLAG").sum()],
    "Min corrected/independent SE ratio": [pd_results["ratio_se"].min()],
    "Mean corrected/independent SE ratio": [pd_results["ratio_se"].mean()],
    "Max corrected/independent SE ratio": [pd_results["ratio_se"].max()]
})

pd_summary = pd_summary.round(3)
display(pd_summary)
Number of companies Significant using independent SE Significant using corrected SE Flagged using corrected two-part rule Min corrected/independent SE ratio Mean corrected/independent SE ratio Max corrected/independent SE ratio
0 34 0 34 16 0.08 0.082 0.085

The table below shows selected companies with the largest relative coefficient shifts. The final flag requires both statistical significance and a relative shift of at least 10%.

Show selected proxy discrimination results code
selected_pd = pd_results.nlargest(10, "shift")[[
    "company", "shift", "se_ind", "se_full",
    "ratio_se", "z_ind", "z_full", "flag"
]].copy()

selected_pd = selected_pd.round({
    "shift": 1,
    "se_ind": 4,
    "se_full": 4,
    "ratio_se": 3,
    "z_ind": 2,
    "z_full": 2
})

display(selected_pd)
company shift se_ind se_full ratio_se z_ind z_full flag
9 Farmers Automobile Ins Assoc 22.2 0.0253 0.0022 0.085 1.07 12.59 FLAG
6 Economy Preferred Ins Co 16.5 0.0248 0.0021 0.084 1.15 13.71 FLAG
18 Metropolitan Cas Ins Co 15.5 0.0258 0.0022 0.084 1.10 13.20 FLAG
29 Travelers Home & Marine Ins Co 14.6 0.0247 0.0020 0.083 0.78 9.46 FLAG
28 Travelers Commercial Ins Co 14.6 0.0246 0.0020 0.083 0.79 9.51 FLAG
20 Metropolitan Prop & Cas Ins Co 12.5 0.0245 0.0021 0.085 1.34 15.73 FLAG
31 USAA Cas Ins Co 11.5 0.0229 0.0018 0.080 0.36 4.53 FLAG
21 Owners Ins Co 11.4 0.0255 0.0022 0.084 1.14 13.48 FLAG
33 United Serv Automobile Assn 11.1 0.0228 0.0018 0.081 0.38 4.75 FLAG
13 Geico Gen Ins Co 11.1 0.0245 0.0020 0.081 0.62 7.65 FLAG
Show proxy discrimination z-statistics plot code
plot_df = pd_results.sort_values("z_full").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["z_ind"], marker="o", linestyle="", label="Independent SE")
plt.plot(range(len(plot_df)), plot_df["z_full"], marker="x", linestyle="", label="Corrected SE")
plt.axhline(Z_CRIT, linestyle="--", linewidth=1)
plt.axhline(-Z_CRIT, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by corrected z-statistic")
plt.ylabel("z-statistic")
plt.title("Correcting the covariance changes the proxy-discrimination test")
plt.legend()
plt.tight_layout()
plt.show()
Figure 3: Proxy-discrimination z-statistics under independent and corrected standard errors.
Note

The proxy-discrimination example shows why the dependence between the two regressions matters. If the cross-covariance is ignored, the standard error of the coefficient shift can be overstated, making the test too conservative.

Appendix: Supporting Materials

Data

Illinois quote data are from the ProPublica car insurance investigation (Larson et al. 2017). We do not redistribute the file here. To obtain it:

  1. Download the ProPublica public data release (multi-state archive).
  2. Extract car-insurance-public/data/il-per-zip.csv.
  3. Save the file in this case study folder as il-per-zip.csv before running the code.

Review ProPublica’s data terms when using the download. Cite (Larson et al. 2017) in any publication that uses these data.

Further reading

  • Huang and Hooker (2026), full inferential theory, audit protocol, and Illinois validation
  • Xin, Hooker, and Huang (2026), how proxy-imputed race distorts regression-based disparity estimates
  • Step 4: Audit the system, qualitative audit pathway for this case study
  • Step 4: Proxied protected attributes, limits when the protected attribute is inferred rather than observed

References

Colorado Division of Insurance. 2023. “CONCERNING QUANTITATIVE TESTING OF EXTERNAL CONSUMER DATA AND INFORMATION SOURCES, ALGORITHMS, AND PREDICTIVE MODELS USED FOR LIFE INSURANCE UNDERWRITING FOR UNFAIRLY DISCRIMINATORY OUTCOMES.” https://drive.google.com/file/d/1BMFuRKbh39Q7YckPqrhrCRuWp29vJ44O/view .
Huang, Fei, and Giles Hooker. 2026. “Fairness Testing for Algorithmic Pricing.”
Larson, Jeff, Julia Angwin, Lauren Kirchner, and Surya Mattu. 2017. “Minority Neighborhoods Pay Higher Car Insurance Premiums Than White Areas with the Same Risk.” ProPublica and Consumer Reports. https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk.
MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25.
New York State Department of Financial Services. 2024. “Proposed Insurance Circular Letter: January 17, 2024.” https://www.dfs.ny.gov/industry_guidance/circular_letters/cl2024_nn_proposed.
Schuirmann, Donald J. 1987. “A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability.” Journal of Pharmacokinetics and Biopharmaceutics 15 (6): 657–80.
Xin, Xi, Giles Hooker, and Fei Huang. 2026. “How Proxy Race Distorts Regression-Based Fairness Audits.” https://arxiv.org/abs/2603.17106.